Replace urlsplit with a new parser #333

irene-sheen-reef · 2025-07-22T05:43:02Z

Previously, urllib.parse.urlsplit was being used to parse the URL. This breaks the filename/path across path, query and fragment introducing various cases where the filename changes from the user-intended filename like:

files starting with /
files ending with ? or #

Replacing it with a simpler parser which only returns (scheme, netloc, path) solves these issues.

olzhasar-reef · 2025-07-22T21:15:06Z

The approach you've chosen overall does the job.
It seems to closely mimic the behavior of urllib.parse.urlsplit (in a simplified manner as you've pointed out). This might be a good thing in some cases, where a generalized solution preserving the original behavior is actually needed. However, I am not sure if we have such case here.
As we essentially need to support only urls of types: b2id:// and b2://bucket/file, is there a simpler way to achieve that with less abstractions and code-induced liability?

irene-sheen-reef · 2025-07-23T06:04:38Z

@olzhasar-reef I replaced the SplitB2Result with a simple namedtuple which should be sufficient. I considered hardcoding the schemes into the regex but that would produce less useful error messages (invalid B2 URI instead of Unsupported URI scheme) so I would prefer to leave that as is.

olzhasar-reef · 2025-07-23T21:38:45Z

@irene-sheen-reef
My point is that this whole url split and unsplit flow feels like an unnecessary overkill for our use case.
As scheme, netloc and path are just intermediary tools, we might as well just parse exactly what's needed straight up.

E.g.:

b2id_pattern = re.compile(r'^b2id://(?P[a-zA-Z0-9:_-]+)$')
b2_pattern = re.compile(r'^b2://(?P[a-z0-9-]+)/(?P.+)$')

And pick the appropriate one based on the prefix.

Are there any downsides compared to the approach you've suggested?

b2/_internal/_utils/uri.py

Replace urlsplit with a new parser

4b41a7d

irene-sheen-reef force-pushed the new-parser branch from 6c54c51 to 4b41a7d Compare July 22, 2025 05:46

Replace class SplitB2Result with a namedtuple

33b1725

Remove uriparse.b2_urlsplit and simplify the code

2c0af93

irene-sheen-reef force-pushed the new-parser branch from 1b04d9d to 2c0af93 Compare July 29, 2025 11:40

olzhasar-reef requested changes Jul 29, 2025

View reviewed changes

b2/_internal/_utils/uri.py Show resolved Hide resolved

b2/_internal/_utils/uri.py Outdated Show resolved Hide resolved

b2/_internal/_utils/uri.py Outdated Show resolved Hide resolved

b2/_internal/_utils/uri.py Outdated Show resolved Hide resolved

Handle path in parse_uri instead of _parse_b2_uri

5c0e518

irene-sheen-reef force-pushed the new-parser branch from 444b7eb to 5c0e518 Compare July 30, 2025 09:51

irene-sheen-reef requested a review from olzhasar-reef July 30, 2025 09:53

olzhasar-reef requested changes Jul 30, 2025

View reviewed changes

b2/_internal/_utils/uri.py Outdated Show resolved Hide resolved

Move URI cleaning into _clean_uri

95c37db

olzhasar-reef merged commit 5d5b5d3 into reef-technologies:master Jul 30, 2025
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace urlsplit with a new parser #333

Replace urlsplit with a new parser #333

Uh oh!

irene-sheen-reef commented Jul 22, 2025

Uh oh!

olzhasar-reef commented Jul 22, 2025

Uh oh!

irene-sheen-reef commented Jul 23, 2025

Uh oh!

olzhasar-reef commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replace urlsplit with a new parser #333

Replace urlsplit with a new parser #333

Uh oh!

Conversation

irene-sheen-reef commented Jul 22, 2025

Uh oh!

olzhasar-reef commented Jul 22, 2025

Uh oh!

irene-sheen-reef commented Jul 23, 2025

Uh oh!

olzhasar-reef commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants